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Abstract 

The Intelligent Data Management (IDM) project at 
NASA/GSFC has prototyped an Intelligent Information 
Fusion System (IIFS), which automatically ingests metar 
data from remote sensor observations into a large catalog 
which is directly query able by end-users. The greatest 
challenge in the implementation of this catalog has been 
supporting spatially-driven searches, where the user has 
a possibly complex region of interest and wishes to re- 
cover those images that overlap all or simply a part of 
that region. 

A novel spatial data management system is described, 
which is capable of storing and retrieving records of image 
data regardless of their source. This system has been 
designed and implemented as part of the IIFS catalog. 
A new data structure, called a hypercylinder , is central 
to the design. The hypercylinder is specifically tailored 
for data distributed over the surface of a sphere, such as 
satellite observations of the Earth or space. Operations 
on the hypercylinder are regulated by two expert systems. 
The first governs the ingest of new metadata records, and 
maintains the efficiency of the data structure as it grows. 
The second translates, plans, and executes users’ spatial 
queries, performing incremental optimization as partial 
query results are returned. 

1 Introduction 

1.1 Needs of the scientific community 

With the planned launching of the Earth Observing Sys- 
tem (EOS) platforms and with the continuing generation 
of data by existing missions such as the Hubble Space 
Telescope (HST), NASA faces one of its greatest chal- 
lenges yet: the cataloging of remote-sensor data in a man- 
ner that will allow users from a variety of scientific dis- 


ciplines to quickly recover datasets of interest from the 
vast, constantly-expanding archive. 

The ability to query or browse large catalogs of im- 
age data by the spatial characteristics of desired datasets 
is involved in solving what is referred to as the spatial 
data handling problem. Whether such a catalog con- 
tains downward-looking images of the Earth or outward- 
looking images of space, the spatial data structures resi- 
dent in the catalog must support two basic spatial search 
operations required by the general scientific community 
(see Figure 1): 

• Window query : given a region of interest, find all 
images that overlap the region. 

• Containment query: given a region of interest, find 
all images that completely contain the region. 

There is also a simple case of these queries, whose use 
is sometimes convenient: 

• Point query: given a point of interest, find all images 
that overlap the point. 

In addition, users require the ability to combine the 
above operations into more complex spatial queries via 
the operators AND, OR, and NOT. 

1.2 Problems with existing approaches 

Most attempts at spatial data handling in data cata- 
logs encounter major difficulties from the start because 
the catalogs are implemented using relational database 
(RDB) packages. RDBs generally do not support data 
structures for handling anything other than linearly- 
ordered records. The object-oriented database (OODB) 
research of recent years provides a means of implementing 
spatial data structures directly inside data catalogs, and 
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drop in performance during search. 



Figure 1: Images A and B both satisfy a window query 
on the shaded region R, but only image B satisfies a con- 
tainment query on R . 

OODB technology has thus been utilized in implement- 
ing the spatial data management system described in this 
paper. 

Some catalogs circumvent most spatial data handling 
problems by virtue of only having to deal with queries in- 
volving a single instrument. By using information about 
the orbit of the instrument’s platform, spatial queries are 
mathematically converted into sets of path-row coordi- 
nates that specify images satisfying the query, and these 
coordinates are used as the search keys for the images. 
The problem with this approach is the lack of both ex- 
tensibility and flexibility. First, metadata from new plat- 
forms and instruments cannot be added without simul- 
taneously authoring new spatial-search software. Sec- 
ond, images with identical path-row coordinates might 
not have identical locations due to fluctuations in the or- 
bit of the platform, so a path-row-based spatial query 
system may falsely accept or reject images during a query. 

Even catalogs that employ robust spatial data handling 
techniques encounter difficulties because they actually 
treat the globe not as a sphere but as a planar surface , a 
consequence of employing spatial data structures that use 
latitude-longitude based coordinate systems. The prob- 
lem is that the surface of a sphere cannot be mapped 
onto a plane without introducing discontinuities and con- 
siderable distortion near the poles, as is evident in most 
cartographic projections. When using planar spatial data 
structures (such as quadtrees or k-d trees) to represent an 
inherently spherical domain, these anomalies present ma- 
jor difficulties in query processing and often result in an 
“unbalancing” of the data structure, leading to an overall 


1.3 The application 

The Intelligent Data Management group is conducting 
research into the development of data management sys- 
tems that can handle the archiving and querying of data 
produced by Earth and space missions. Several unique 
challenges drive the design of these systems, including 
the volume of the data, the use and interpretation of the 
data’s temporal, spatial, and spectral components, the 
size of the userbase, and the desire for fast response times. 

The IDM group has developed an Intelligent Informa- 
tion Fusion System (IIFS) for testing approaches to han- 
dling the archiving and querying of terabyte-sized spatial 
databases (see Figure 2). Major components of the sys- 
tem are the mass storage and its interactions with the 
rest of the system [Camp91]; the real-time planning and 
scheduling for processing the data [Short91]; the extrac- 
tion of metadata and subsequent construction of fast in- 
dices for organizing the data along various search dimen- 
sions [Camp89] [Cromp91] [Dorf91]; and the overall user 
interface. 

The IIFS design is novel in a number of areas. Semantic 
data-modeling techniques are used to organize the mass 
storage system to reduce the transfer times of the data 
to on-line devices and the mechanical motions of the sup- 
porting robotics. Data percolates from near-line mass 
storage to on-line disk storage based upon its frequency 
of use. A combination of neural networks and expert sys- 
tems defines how metadata is extracted to build up search 
indices to the underlying database. The metadata itself is 
organized in an object-oriented database which has spe- 
cial data structures for representing the multiple views of 
the data (such as temporal, spatial, spectral, project, sen- 
sor) without resorting to multiple copies of information. 
A special data structure that maps directly between the 
Earth and a sphere organizes the data for efficient spatial 
querying. The user interface is configured dynamically at 
run-time depending on the scientist’s discipline and the 
current knowledge in the object database. 

Experimentation with the IIFS design and implemen- 
tation have shown that greater flexibility is needed in the 
spatial data handling routines so that images with a vari- 
ety of coverage and orientation can be uniformly retrieved 
with respect to a user’s region of interest. The remainder 
of the paper discusses the enhancements that have been 
made to the IIFS spatial data structures and describes an 
overall spatial data handling system that combines declar- 
ative and procedural knowledge for efficiently managing 
spatial queries. 
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Figure 2: The high-level architecture of the Intelligent Information Fusion System. 


1.4 A solution 

A design for a spatial data structure suitable for a large, 
heterogeneous image database with global coverage must 
account not only for the goals of Section 1.1, but also 
the difficulties introduced by the richness of the remote 
sensing domain: 

• Multiple image orientations , due both to different 
satellite orbits, and because there is no such thing 
as “fixed orientation” on the surface of a sphere. 

• Multiple image shapes, due to the variety of sensors, 
the tilt of the individual spacecraft, and the alter- 
ation of the image border by geometric correction. 

• Multiple image sizes in terms of the extent of the 
image boundaries on the surface of the sphere: e.g., 
sensors mounted on airplanes have smaller fields of 
view than similar sensors mounted on orbiting plat- 
forms. 

The data structure described in this paper, together 
with the supporting expert systems for ingest and query- 
ing, addresses all these concerns. The result is a spatial 
data handling system which can handle NASA’s next gen- 
eration of image catalogs. 


2 Simplifying spatial queries by a 
transformation scheme 

2.1 The general concept 

A variety of spatial data handling problems in complex 
spatial domains can be solved by mathematically trans- 
forming the domain D into a new domain D 1 where the 
corresponding queries can be handled more efficiently 
[Same90, p. 186]. Such transformations map a complex 
object in D (in this case, an image) into a single point in 
D *: this point is referred to as the object’s representative 
point We are then left with the simpler goal of designing 
a data structure that can handle the storage and query 
of points rather than arbitrary shapes. Two difficulties 
with this approach can be encountered: 

• A query region R in D must be transformed into 
its equivalent R! in D* , and R f may be difficult to 
generate or to calculate with, even for simple R. 

• The transformation may result in some loss of infor- 
mation about the stored objects, so that additional 
computation may be needed to exactly satisfy a spa- 
tial query. 

These difficulties are dealt with in Section 4, where the 
implementation of the data structure is described. 


215 





Figure 3: The minimal bounding circle of an image on the 
globe. Note that the radius is measured along a great- 
circle arc, like all distances on the surface of a sphere. 


2.2 A transformation scheme for image 
data 

In order to transform images into points, we discard the 
actual boundaries of the image and concern ourselves only 
with its minimal bounding circle, which we shall call the 
representative circle for the image (Figure 3). This is 
closely related to the approach taken by [Oost90], which 
takes the minimal bounding circles of objects on a planar 
surface instead of on a spherical surface. Note that the 
“representative circle” approach eliminates the problems 
of multiple image shapes and orientations. 

By treating images as circles, we are able to describe 
every image by only two parameters: the location of the 
circle’s center, which we shall denote as a, and the radius 
of the circle, which we shall denote as p. Thus, every 
image can be treated as a simple point (<r,p). Under the 
terminology of [Hinr83], a is the point’s location parame- 
ter, and p is the point’s extension parameter. 

2.3 Visualizing the transformation 

Consider a part of the globe over which several images 
have been taken, shown in Figure 4. For illustration pur- 
poses we will show only a small part of the globe so that 
it may be rendered as a simple plane, although it must be 
stressed that what is actually being shown is a portion of 
a curved surface. This would represent a scenario in D. 

To map this scenario to D * , we compute for each image 
Ii the center <Ji and radius pi of its minimal bounding 
circle, and plot the resulting point (<r,*,pi) in D f as shown 



Figure 4: A portion of D , showing a group of images and 
their minimal bounding circles. 


in Figure 5. 


To more compactly illustrate what the space of D' looks 
like, we must make some diagrammatic simplifications. 
Figure 6 shows how the surface of a sphere can be mapped 
onto the perimeter of a circle by means of a space-filling 
curve . This is a single curve that begins in the diagram at 
point A y passes through every point B, C } D , etc. on the 
sphere, and eventually returns to A (also labeled Z in the 
diagram). The curve places an ordering on the points: A 
is before B, B is before C, etc., and this ordering enables 
us to place every point on the sphere’s surface onto the 
perimeter of the circle below. Note that points which are 
close to each other on the circle (like B and C) correspond 
to points which are close to each other on the sphere. 


By using this mapping, the scenario of Figure 5 is de- 
picted again in Figure 7. Here, D f is shown as the surface 
of a cylinder: the position on the vertical axis represents 
the p value, and the position along the circular perimeter 
represents the a value. 

Since individual sensors can be expected to produce 
large numbers of images of the same size, we expect the 
distribution of representative points for a large, hetero- 
geneous image database not to be uniform, but instead 
to be concentrated in different strata along the p axis 
(Figure 8). 
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Figure 5: The representative points of the images in Fig- 
ure 4, plotted in a portion of D* . 



Figure 6: How the surface of a sphere (above) can be 
mapped onto the perimeter of a circle (below) by using 
a space-filling curve A> B y . . . , Y y Z. For claxity, the curve 
on the sphere is not shown in its entirety. 



Figure 7: The same representative points as in Figure 5, 
this time plotted on a cylinder to represent D* more com- 
pactly. Every circular cross-section of this cylinder rep- 
resents the entire surface of a sphere (the globe). 


217 



aerial imagery 
(7.5’ photoquads) 

SPOT images 
MSS/TM images 
CZCS images 
AVHRR images 

full hemispheres 



(Space Shuttle, etc.) 



Figure 8: The expected distribution of representative 
points in D* . For convenience, p is shown on a logarithmic 
scale. 


Figure 9: Criteria for a representative circle to overlap 
R. Both circles have the same radius, pi = p 2 = r, but 
different locations. 


3 Processing queries in the trans- 
formed space 

3.1 Processing window queries 

Given a query region R on a sphere, we note that the 
further the center of a circle C is from R, the larger the 
radius of C must be if C is to overlap R. Let 0rotu(Z£,r) 
denote the locus of all points that are within a distance 
of r from R: read this as “grow R by radius r” A sample 
R and grow{R i r) are depicted in Figure 9. We observe: 

A representative circle C, = (<Ti,Pi) overlaps a 
region R if and only if its center falls inside 
grow(R } pi). 

This rule is demonstrated in Figure 10, which depicts in 
D a query region R, the representative circles for four im- 
ages C\ . . . C4, and the region grow(R f pi) for the various 
image radii pi . Note that: 

• <T\ is not inside grow(Rypi) f and <r$ is not inside 
grow(R,p3). Therefore, neither C\ nor C3 overlap 
R. 

• 02 is inside grou^R 1 p 2 ), and 04 is inside groti^R ) p 4 ). 
Therefore, both C 2 and C 4 overlap R. 

Now, consider Figure 11. It depicts in D* the represen- 
tative points Ci = {<7i>pi) for the images in Figure 10. 
For each point the corresponding region grov^ R y p%) has 
been plotted, on the same cross-section of D* where Ci 


resides. Notice that, if grow{Ry r) had similarly been plot- 
ted for all r in p , the regions would trace out a cone-like 
solid in D*. In terms of formulating window queries in 
JO 7 , this means that: 

An image’s circle in D overlaps a region R if and 
only if its representative point in D* falls within 
the cone in L? whose cross-section at p = r is 
grov^Ry r). 

If the region R is a single point p, then this becomes 
the definition for a point query, where grou^p, r) is simply 
a circle with center s and radius r. 

[Same90, pp. 187-192] observes that, when employing 
transformation schemes which represent stored objects by 
points that have distinct location and extension param- 
eters, window queries and containment queries generally 
produce cone-like search regions. This is also true of the 
model described above (Figure 12), and for this reason 
we refer to the search regions in D f as search cones. 

3.2 Processing containment queries 

Up to this point we have dealt with window queries, which 
produce cone-like search spaces in D f . A containment 
query’s search space is also a cone- like region, but differ- 
ing in the way a cross-section of the cone is defined for 
a given value r of p: instead of its being the locus of all 
points p such that any point of R is within a radius of r 
from p, it is the locus of all points p such that all points of 
R are within a radius of r from p. Call this cross-section 
covcr(R t r). 
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grow(R, P 4 ) 
grow(R t P 3 ) 
grow(R, f%) 
grow(R, f>j) 
R 



Figure 10: Four representative circles, as they would appear 


in D. Also shown are grow(R,pi) for each 


circle C,. 


Whereas grow(R, r) is relatively easy to compute even 
on a sphere, cover[R } r) is more complex. But, as it turns 
out, we need never compute cover(/2,r) directly to pro- 
cess queries. 

To begin, notice that since an image with radius x/2 
is a full hemisphere, we need not concern ourselves with 
images where r > ir/2. It can be shown that for all r < 
tt/ 2, if all the vertices of R are within a radius of r from 
a given point p, then the edges between those vertices are 
completely within a radius of r from point p, and thus all 
of R is within a radius of R from point p. So cover(R y r) 
is actually the locus of all points within a radius of r 
from every vertex of R. Therefore, if R has n vertices, 
cover(R } r) is the intersection of n circles of radius r whose 
centers are at the vertices of R (Figure 13). 

Let R have vertices t>i . . . v n . In terms of containment 
queries, this means that: 


An image’s circle in D completely contains a 
region R if and only if its representative point 
in D* falls inside all the search cones S \ . . . S nr 
where the cross-section of 5, at p = r is 
grotu(vi } r). 


The search cone for the containment query can thus be 
defined as the intersection of n search cones: the search 
cones for the point queries on the n vertices of R (Fig- 
ure 12). 


4 The hypercylinder data struc- 
ture 

4.1 Design issues arising from implemen- 
tation details 

At the core of the spatial data handling system is the data 
structure that stores and retrieves points in D f y named 
the hypercylinder because of the shape of the transformed 
space. To ensure that it is capable of efficiently processing 
queries in D, we must consider factors that place practical 
limitations on how the corresponding search regions in D' 
can be manipulated. 

Although the cross-section of a search cone at a given 
value of p is easy to generate, computations involving the 
cone itself require a great deal more processing. There- 
fore we shall handle queries by dividing the search cones 
into cross-sections that may be dealt with individually 
(Figure 11 provides an illustration of “slices” of a search 
cone). The ramification for the data structure is that 
D f must be represented internally as a collection of slices 
that can be queried independently. 

Since we must still compute cross-sections for each slice 
of the data structure, we need to divide D ( into a man- 
ageable number of slices. If slices are infinitely thin (i.e., 
the data points in a given slice all have the same p value), 
then even small variations in the extents of images will 
result in a need for a large number of slices. We thus let 
each slice cover a range of p values. 

To. execute a query (Figure 14), we handle one slice 
of the search cone at a time, and then merge the results 
together to form the final result. For each slice, we corn- 
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Figure 11: The representative points for the circles in Figure 10, as they would appear in D* . Also shown are 
grou^R.pi) for each circle C/. 
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Point query Window query Containment query 

on p onR onR 


Figure 12: The search cones (shaded) in D' for a point query, window query, and containment query. Notice that 
the search cone for the containment query is the intersection of four search cones for point queries. 



Figure 13: A region R and cover(R, r). Any circle of radius r in cover(R,r ) will overlap all points in R. 
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Search cone (points which satisfy query) 


Points in slice which definitely satisfy query 
Points in slice which possibly satisfy query 
m Points in slice which definitely do not satisfy query 


Figure 14: Slicing up D f , and approximating the portion of a search cone inside the slice from p = s to p = t. By 
computing the interior and exterior boundaries of the search cone in that slice, we can divide the points in that slice 
into three groups - those that definitely satisfy, possibly satisfy, and definitely do not satisfy the query. 




( s 


Figure 15: The hypercylinder data structure. Square nodes belong to the BST for p, circular nodes belong to the 
SQTs for white nodes are internal nodes, and black nodes are leaf nodes. A close-up of one slice is depicted, with 
n leaf nodes in its SQT. Also shown is how some of the SQT leaf nodes (numbered) might look if the surface of the 
slice were “unrolled” (bottom), and how the corresponding trixels might look on the surface of a sphere (right). 
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putc two cross-sections of the query’s search cone: one 
where the cone passes through the top of the slice, and 
one where it passes through the bottom of the slice. These 
two cross-sections give us, respectively, the interior and 
exterior boundaries of the search cone as it passes through 
the slice. We observe that: 

• Data points from inside the interior boundary are 
definitely inside the search cone. 

• Data points from between the interior and exterior 
boundaries are possibly inside the search cone, and 
must be tested on an individual basis. 

• Data points outside the exterior boundary are defi- 
nitely outside the search cone. 

Notice that the thicker the slice, the greater the dif- 
ference between the interior and exterior boundaries, and 
hence the more data points we can expect to have to test 
in this region - which we shall call the “possibly-satisfy” 
region - during a query. To maximize query efficiency, we 
must slice up the hypercylinder so that areas of D f with 
many data points are sliced thin, while areas of D‘ with 
very few data points may be covered by thick slices since 
fewer points will need to be tested in those areas. As 
revealed in Figure 8, we expect the representative points 
for images to be largely concentrated in different “strata” 
of D f . Unfortunately we cannot predict where all such 
strata will eventually lie, due to the continuous launching 
of new sources of image data. 

As the number of points within a slice grows, eventu- 
ally the density of points within that slice is such that ex- 
cessive time is spent deciding whether to accept a point 
within the “possibly-satisfy” region. At this time, the 
slice must be split so that the collection of points within 
the subslices is more homogeneous. A heuristic approach 
to recognizing when this division should occur and where 
the division should be made is given in Section 5.1. 

4.2 The data structure design 

For an overview of the hypercylinder’s design, refer to 
Figure 15. The top-level view is a binary search tree 
(BST), whose branches discriminate between values of p 
and whose leaves are the slices of D' . The data struc- 
ture at each leaf is a sphere quadtree (SQT) [Feke84] 
[Feke90], a special variation of a quadtree designed for 
storing and retrieving points distributed on the surface of 
a sphere. The branches of a SQT discriminate between 
values of <r and the leaves represent triangular regions of 
the globe (Figure 16). The representative points of im- 
ages are stored in the leaves of each SQT. 

The sphere quadtree is a unique data structure in 
that it models the globe without introducing distortions 



Figure 16: How a sphere quadtree divides the globe into 
triangular patches (called trixels). The higher the level 
number of a trixel, the deeper in the tree it is, and the 
smaller the area it covers. 

or discontinuities, as other approaches such at latitude- 
longitude based schemes do (see Section 1.2). Conceptu- 
ally, it divides the sphere into twenty identical equilateral 
triangles called trixels, where each trixel is a “bucket” for 
data points. When a trixel reaches its threshold num- 
ber of data points, it is split into four nearly equal-area 
subtrixels. This subdivision is called refinement since it 
produces smaller trixels which, like the pixels in an im- 
age, can represent regions to a higher degree of resolu- 
tion. As with most spatial data structures, refinement in 
a SQT can continue indefinitely: the result is that areas 
of the globe that are densely populated are more refined, 
so query regions in those areas are more accurately rep- 
resented by the higher resolution trixels in the SQT. 

Since satellite orbits generally provide global coverage, 
we expect the SQT for each slice to be fairly equally re- 
fined over most of the surface of the globe, i.e., the SQT 
is well balanced, and so spatial queries are handled with 
similar efficiency regardless of their location. But since 
satellite orbital paths are often designed to produce fully- 
overlapping images at each pass over a location, a clus- 
tering of the representative points occurs and results in 
a tree that is globally well-balanced, but locally unbal- 
anced. A means for overcoming this problem exists, and 
is discussed in Section 7.1. 

To ensure that the slices are split in an optimal man- 
ner when they achieve their threshold number of data 
points, a profile of how the data points are distributed 
in each slice is maintained. These profiles are used as 
heuristic devices to determine where the slices should be 
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subdivided. Their actual implementation is described in 
Section 5.1. 

5 A spatial data management 
system 

The primary motivation for designing an entire spatial 
data management system for the remote sensing domain, 
as opposed to just the custom-tailored spatial data struc- 
ture described above, is that domain knowledge can often 
be employed to improve the overall performance of any 
data management scheme. For a remote-sensing catalog, 
such knowledge encompasses: 

• Model information: what real-world entities and con- 
cepts (observations, sensors, geographic regions, sci- 
entific parameters, classification schema) are repre- 
sented in the catalog, and what sorts of questions 
may be asked about them by end-users. This is 
mostly declarative information, intended for use by 
both the end-users and the system. It allows the 
users to ask the system about its contents, and it en- 
ables the system to translate users* natural-language 
and graphical queries into the system’s internal rep- 
resentation. 

• Data structure information: what data structures 
(e.g., the hypercylinder) exist in the database, under 
what conditions they should be used (e.g., spatial 
queries), and how data is distributed in them (e.g., 
the profiles mentioned in Section 4.2). This proce- 
dural and declarative metaknowledge is used by the 
catalog to construct plans for queries, to generate 
the necessary calls to the catalog’s underlying data- 
base management system, and to optimize the query 
plans as intermediate results are returned. 

• Operational information: the performance of the 

hardware devices over which the database is dis- 
tributed, the anticipated system loads over the 
course of a typical day or week, the types of queries 
most frequently made, etc. This is largely declara- 
tive information, used by the catalog in performing a 
variety of tasks ranging from query optimization to 
automatic data structure reorganization. 

The spatial data management system makes use of such 
information in its two supporting expert systems: the 
Spatial Ingest Expert System and the Spatial Query Ex- 
pert System, both of which are discussed below. 


5.1 The Spatial Ingest Expert System 
(SIES) 

The primary function of the SIES is to govern the splitting 
of the slices of the hypercylinder, ensuring that each slice 
is divided so that dense strata of D* end up in thin slices, 
with thick slices covering the sparser expanses of D* . As 
mentioned in Section 4.2, each slice maintains a profile of 
the current distribution of the data points in the slice. In 
addition to this, the SIES incorporates information about 
the expected future distribution of points. Both provide 
a heuristic means of optimizing the hypercylinder as it is 
being built. 

The primary requirements for profiles are that they 
must be easy to update during ingest, be implemented to 
allow rapid calculation during splitting, be large enough 
to adequately capture the distribution of points in a slice, 
and be small enough not to incur a large storage overhead. 
We have experimented with an approach based on incre- 
mental sampling of the representative points as they are 
stored in the slice: the profile consists of a small reservoir 
of the p values of sampled points, and as each new point 
is ingested there is a chance that one of the current ele- 
ments of the reservoir will be replaced by the p value of 
the new point. During splitting, the profile is analyzed 
to determine where the p values are clustered, indicating 
emerging strata in D f . 

The profiles are supplemented in the knowledge base 
by the bias list : a list of strata into which D ' is expected 
to be organized. The bias list is updated whenever a 
new instrument is added to the knowledge base, and con- 
tains the expected minimum and maximum radii of im- 
ages that the instrument will generate plus an estimate 
of how many images will be generated over the lifetime 
of the instrument. Although the bias list can supply in- 
formation on where a slice is best split (or even whether 
to defer splitting a slice), its contents do not reflect the 
actual state of the data structure, and thus neither it nor 
the profiles are expected to provide maximal performance 
in isolation. 

The hypercylinder initially consists of a single slice cov- 
ering all of p. When the number of points stored in any 
slice reaches the threshold value for splitting, a strategy 
for dividing the slice is formulated from one of several 
alternatives, such as: 

• Place the largest cluster into its own slice, and the 
spaces to either side of this cluster into two additional 
slices (Figure 17). 

• Place the largest expanse of sparsely-populated space 
into its own slice, and the spaces to either side of it 
into two additional slices. 

• Given an entry in the bias list whose minimum and 
maximum radii fall within the slice and whose esti- 




Figure 17: One strategy for splitting a slice, based on the distribution of points in the slice: this “fences in” the 
largest cluster of points, putting it into its own slice. 



Figure 18: How non-spatial components of a query can place implicit spatial constraints. Only the shaded layers of 
the hypercylinder must be searched. 
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mated number of images is high, place that range of 
values into its own slice and the spaces to either side 
of it into two additional slices. 


• Split the slice so that equal numbers of points are in 
each subslice. 

The strategy chosen depends on factors such as how 
definite the clusters are, how widely they are distributed, 
and whether one cluster is significantly larger than the 
others. Each fact lends weight to one or more of the 
strategies during selection: these weights are then ad- 
justed as more is learned about the performance of the 
SIES under the real-world environment. 

5.2 The Spatial Query Expert System 
(SQES) 

The SQES is actually a conceptual subset of the larger 
Query Processing Expert System, a mostly-procedural- 
knowledge base whose content is the model and data 
structure information described in Section 5, and whose 
purpose is the translation, optimization, and execution of 
user queries. The SQES handles those parts of the task 
that relate to spatial searches. 

The first place the SQES is invoked is during the pars- 
ing of queries with symbolic spatial components. In 
natural-language and menu-driven queries, such compo- 
nents might appear as the names of geographic, political, 
or climatological regions on the Earth (or as the names of 
stellar objects or constellations, depending on the catalog 
type). The SQES translates these terms into geometric 
region descriptions, possibly invoking external informa- 
tion sources in the process, such as databases that house 
geopolitical boundaries, or that store the names of as- 
tronomical entities under different labelling schemes to 
allow translation from one scheme to another (e.g. the 
SIMBAD database). 

Figure 18 shows how the SQES can optimize the spatial 
search by inferring additional spatial constraints from the 
user’s query. The user’s specification of desired ranges of 
image resolution and spectral bands constrains the set of 
instruments that might be sources of the desired data, 
which in turn limits the possible sizes of images that can 
be returned from the user’s query, which in turn pin- 
points the only slices of the hypercylinder that need to 
be searched. 

The SQES also assists in planning complex spatial 
queries, where the order in which subparts of the query 
are executed can play a dramatic role in decreasing pro- 
cessing time. Consider the processing of a containment 
query: as noted above, containment queries are best han- 
dled as the intersection of a collection of point queries. 
Throughout the system, computing the intersection of a 



Figure 19: The profile used by the SQES. Darker trixels 
indicate that observations are more dense in those por- 
tions of the globe. 

group of unknown sets is performed in a st rategic man- 
ner: the members of the group are retrieved sequentially, 
from smallest to largest estimated size, and the most re- 
cently retrieved set is intersected with a running “result” 
set. The system stops and returns the empty set as tin* 
result of the intersection if any of the retrieved sets is the 
empty set. 

To allow the SQES to estimate the relative sizes of sets 
returned by the components of a spatial query, yet an- 
other profile is kept in the knowledge base. Whereas the 
previously-discussed profiles represent the distributions of 
the image radii , this new profile represents the distribu- 
tion of the image centers on the globe, giving in effect, the 
“density” of observations around the surface of the globe. 
Since it associates spherical locations with density values, 
this profile is implemented as a small spherical quadtree 
(Figure 19). When the SQES is confronted with a set of 
query regions that must be ordered by expected content, 
the area of each region is computed and multiplied by its 
average density from the profile to produce an estimate 
of the number of data points in the entire region, and it 
is by this estimate that processing order is determined. 
We intend to install similar profile-based approaches for 
aiding the construction of query plans on top of all the 
catalog’s principle data structures. 

6 Results 

The spatial data handling system has been tested using 
a portion of the metadata stored in the Pilot Land Data 
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System (PLDS) catalog. Approximately 3,000 records of 
TM, MSS and AVHRR metadata were ingested from a 
flat file into the hypercylinder’s SIES via a C program 
which extracted the appropriate fields for each image’s 
location, boundaries, and primary key. To assess the per- 
formance and extendability of the SIES under different 
implementation languages, it was written both in Quintus 
Prolog and in CLIPS, an expert system shell developed 
by NASA’s Johnson Space Center and capable of being 
linked into a C program and accessed via simple function 
calls. The SIES sends the appropriate ingest requests 
to the hypercylinder “server”: a C++ program contain- 
ing the hypercylinder data structure, accessible through a 
TCP/IP socket on a Sun-4. All of the spatial search rou- 
tines, as well as the profiles for the hypercylinder’s slices, 
were implemented in C++ and reside in the server. Query 
requests are sent to the SQES, a CLIPS/C module that 
uses a small, array-based version of the SQT (called a 
linear SQT) to store the SQES profile. The SQES per- 
forms the necessary query planning and sends the various 
partial spatial query requests to the hypercylinder server, 
which keeps track of execution times for various tasks. 

The spherical quadtree components of the system and 
the supporting spherical geometry routines have been im- 
plemented and tested independently inside the IIFS cat- 
alog’s database. The catalog uses the Smalltalk-based 
GemStone DBMS, a commercial object-oriented DBMS 
available from Servio Logic Corporation, which is capa- 
ble of invoking external C and C++ functions. 

The rationale for initially implementing the full hy- 
percylinder as an in-core data structure rather than in- 
side the catalog database was twofold. First, it enabled 
us to seamlessly integrate the hypercylinder with the 
C++ spherical geometry objects (such as query regions) 
and routines (such as grow()) necessary for spatial query 
handling. We found C++ to be an excellent program- 
ming platform for rapid data structure prototyping, and 
are currently using Oregon C++ from Oregon Software, 
which conforms to the base ANSI documents for this lan- 
guage and thus should produce highly portable code. Sec- 
ond, initial implementation and testing in core enabled 
us to take CPU-time measurements without concerning 
ourselves with the I/O and CPU overhead that would be 
introduced by interfacing with a DBMS. 

The grow() routine performed well for any given radius: 
the algorithm is O(n), where the inputs are the n vertices 
of the query region R and a radius of expansion r, and the 
outputs are the m vertices of grow(R, r), where n < m < 
cn for a predefined constant c. Unfortunately, the grown 
query region is almost always self-overlapping, and some 
necessary computations (such as determining whether a 
point is inside grow{R,r)) take 0(n 2 ) to process using 
our current algorithms. Removing the self- intersections 
from a spherical region appears to be an 0(n 2 ) operation 


in the best case: we are therefore focusing our attentions 
on developing more efficient algorithms for manipulating 
the self-overlapping regions. 

7 Future research 

7.1 The hypercylinder 

The hypercylinder data structure, designed to meet strin- 
gent ingest and query requirements for large image cata- 
logs, is nevertheless only one possible data structure and 
is specifically designed for image data. We hope in the 
near future to: 

• Produce additional spatial data structures cus- 
tomized for efficient storage and retrieval of other 
types of observations, such as observations in atmo- 
spheric domains with additional spatial search crite- 
ria such as “altitude.” 

• Implement these data structures fully inside the cat- 
alog’s database, ensuring that the hypercylinder’s 
components are clustered so as to minimize page 
faults during tree traversal. 

• Introduce tree compression techniques for the SQTs, 
as per [Ohsa83], to eliminate the clustering problem 
mentioned in Section 4.2. 

7.2 The Spatial Ingest Expert System 

In future implementations, we plan to expand the role of 
the SIES in the spatial data ingest process. The SIES will 
be empowered to: 

• Periodically survey the data structure for conditions 
that would compromise efficiency, such as tree imbal- 
ance. If such conditions are detected, the SIES must 
determine how best to reorganize the data structure, 
and notify the database administrator (DBA) of the 
problem. 

• Estimate the amount of system resources that a re- 
optimizing step will take, and, based on profiles of 
system loads, suggest to the DBA the best times for 
self-correction. 

• Maintain a history of major decisions affecting the 
data structure: when a slice was split and why, when 
the data structure had to be reoptimized and why, 
etc. Alert the DBA if it is determined that some 
subset of the rules has contributed to poor decisions. 


228 



7.3 The Spatial Query Expert System 

Much of the spatial query optimization is intended to be 
handled by the catalog's proposed Query Planning and 
Execution Module (QPEM), in which the SQES knowl- 
edge will reside. However, there are still spatial search 
strategies unique to the SQES that have yet to be ex- 
plored: 

• Transfer more spatial query processing control from 
the hypercylinder to the SQES. This would involve 
maintaining a collection of density profiles, each cov- 
ering a different slice of the hypercylinder. Spatial 
queries would be handled and optimized indepen- 
dently by each slice of the hypercylinder, based on 
local profile information. 

• Allow the user to specify different levels of spatial 
query processing. Since many stages of query pro- 
cessing in the data structure divide the tree into three 
types of branches - definitely satisfies query, possibly 
satisfies query, and definitely does not satisfy query 
- the user can be given the power to trade precision 
for execution time by deciding to either accept, re- 
ject, or vigorously test the “possibly satisfies query” 
branch. 

8 Summary and conclusions 

The research presented in this paper is intended to serve 
as the foundation for a new generation of spatial data 
management systems at NASA, tailored for the general 
remote-sensing domain and robust enough to support ef- 
ficient spatial searches regardless of the shape or location 
of the user's area of interest. 

The hypercylinder's two controlling expert systems, the 
SIES and the SQES, are as necessary as they are novel. 
By using rule firings in a supervisory expert system to 
activate data management tasks, the conditions under 
which different data management strategies are employed 
can be easily monitored, evaluated, and altered to fine- 
tune system performance. This is a major step beyond 
conventional catalog schemes, where inspection and eval- 
uation of the underlying data structures are at best ex- 
tremely difficult, and adjustment of the associated algo- 
rithms is traditionally impossible without down-time for 
code recompilation. 
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